Facial Emotion Recognition (FER) is an integral part of behavioral analysis, intelligent monitoring systems, and human-computer interactions. In this article, a novel approach to a CNN-LSTM-based face emotion identification system is proposed, which can work on static image, recorded video, or dynamic webcam feeds. In this approach, the first step is to identify the face parts using picture normalization and scaling algorithms. Then, a CNN is used to analyze the spatial information of the faces, and an LSTM is used to analyze the temporal information between the frames to provide stable predictions. In this regard, the proposed approach is validated using the FER2013 dataset, which contains real-world factors such as occlusion, posture, and lighting changes. From the experimental results, it is evident that the proposed approach is superior to the existing CNN-based approach in terms of recall and F1-score, while its accuracy is on par. This approach can be used in real-time applications such as affective computing, surveillance, etc., owing to the real-time visualization of emotions.
Introduction
The text focuses on Facial Emotion Recognition (FER), an important technology used in human-computer interaction, healthcare, surveillance, and intelligent monitoring systems to detect human emotions from facial expressions.
1. Background and Challenges
Traditional FER methods rely on:
Handcrafted features such as Local Binary Patterns (LBP), Histogram of Oriented Gradients (HOG), and facial landmarks
Classical machine learning models like SVM, K-NN, and Random Forest
However, these approaches struggle in real-world conditions due to variations in:
Lighting (illumination)
Facial orientation (pose)
Occlusion
Expression diversity
Although Convolutional Neural Networks (CNNs) significantly improve accuracy by automatically learning features, they require large datasets and high computational power, limiting real-world deployment.
2. Motivation for the Proposed System
To overcome the limitations of both traditional and deep learning-only methods, the work proposes a hybrid Facial Emotion Recognition system that combines:
CNN (for spatial feature extraction)
LSTM (for temporal sequence learning in videos)
This allows the system to work effectively on images, videos, and real-time webcam input.
3. Related Work Overview
The literature highlights major contributions in FER:
ResNet, Inception, Xception, AlexNet → Improved deep learning architectures for better accuracy and efficiency
LSTM networks → Enabled modeling of temporal emotion changes in videos
Transfer learning models (e.g., FaceNet-based systems) → Improved adaptation to emotion tasks
Viola-Jones algorithm → Fast face detection used in preprocessing
TensorFlow and Scikit-learn → Common frameworks for implementing deep and classical models
Adam optimizer → Improved training speed and convergence
3D CNNs and multi-network systems → Enhanced spatial-temporal emotion recognition performance
4. Proposed Method
The system is a CNN–LSTM hybrid model designed for real-time emotion recognition.
A. Preprocessing
Face detection using Haar cascade
Conversion to grayscale
Normalization to reduce lighting effects
B. CNN Feature Extraction
CNN processes images to extract spatial features using:
Convolution layers (feature extraction)
ReLU activation (non-linearity)
Pooling layers (dimension reduction)
Flattening (conversion into feature vector)
This step captures facial structures and expression patterns.
C. LSTM Temporal Modeling
For videos or live streams:
CNN features from consecutive frames form a sequence
LSTM learns how emotions evolve over time using:
Forget gate
Input gate
Output gate
Cell state updates
This improves stability and consistency in predictions across frames.
D. Emotion Classification
Final classification is performed using a Softmax layer
Model is trained using categorical cross-entropy loss
5. Key Idea and Advantage
The hybrid CNN–LSTM approach:
Uses CNN for spatial understanding of facial expressions
Uses LSTM for temporal consistency across video frames
Result:
It produces more robust, stable, and accurate emotion recognition, especially in real-time and video-based scenarios, compared to frame-by-frame CNN methods.
References
[1] K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition,” Proc. IEEE CVPR, 2016.
[2] O. M. Parkhi, A. Vedaldi and A. Zisserman, “Deep face recognition,” British Machine Vision Conference (BMVC), 2015.
[3] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[4] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” Proc. IEEE CVPR, 2017.
[5] G. Levi and T. Hassner, “Emotion recognition in the wild via convolutional neural networks and mapped binary patterns,” International Conference on Multimodal Interaction, 2015.
[6] A. Krizhevsky, I. Sutskever and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, 2012.
[7] C. Szegedy et al., “Going deeper with convolutions,” Proc. IEEE CVPR, 2015.
[8] M. Abadi et al., “TensorFlow: A system for large-scale machine learning,” USENIX Symposium on Operating Systems Design and Implementation, 2016.
[9] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
[10] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations (ICLR), 2015.
[11] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” Proc. IEEE CVPR, 2001.
[12] I. J. Goodfellow, Y. Bengio and A. Courville, Deep Learning. MIT Press, 2016.
[13] B. Hasani and M. H. Mahoor, “Facial expression recognition using enhanced deep 3D convolutional neural networks,” Proc. IEEE CVPR Workshops, 2017.
[14] H. Ding, S. Zhou and R. Chellappa, “FaceNet2ExpNet: Regularizing a deep face recognition net for expression recognition,” IEEE International Conference on Automatic Face & Gesture Recognition, 2017.
[15] Z. Yu and C. Zhang, “Image based static facial expression recognition with multiple deep network learning,” International Conference on Multimodal Interaction, 2015.